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Abstract 

Most papers on high-dimensional statistics are based on the assumption that none 
of the regressors are correlated with the regression error, namely, they are exogeneous. 
Yet, endogeneity arises easily in high-dimensional regression due to a large pool of 
regressors and this causes the inconsistency of the penalized least-squares methods 
and possible false scientific discoveries. A necessary condition for model selection of a 
very general class of penalized regression methods is given, which allows us to prove 
formally the inconsistency claim. To cope with the possible endogeneity, we construct 
a novel penalized focussed generalized method of moments (FGMM) criterion function 
and offer a new optimization algorithm. The FGMM is not a smooth function. To 
establish its asymptotic properties, we first study the model selection consistency and 
an oracle property for a general class of penalized regression methods. These results are 
then used to show that the FGMM possesses an oracle property even in the presence 
of endogenous predictors, and that the solution is also near global minimum under 
the over-identification assumption. Finally, we also show how the semi-parametric 
efficiency of estimation can be achieved via a two-step approach. 
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1 Introduction 



In recent years ultra-high dimensional models have gained considerable importance in 
many fields of science, engineering and humanities. In such models the overall number of 
regressors p grows extremely fast with the sample size n. In particular, p = 0(exp(n")), 
for some a G (0,1). Hence p can grow non-polynomially with n, as in the so-called NP- 
dimensional problem. Sparse modeling has been widely used to deal with high dimensionality 
and "Big Data" . For example, in the regression model 



it is assumed that most of the components in (3q are zero, and therefore only a few regressors 
are important that captures the main contributions to the regression. The goal of ultra 
high dimensional modeling is to achieve the oracle property, which aims at (1) achieving 
the variable selection consistency (identify the important regressors with high probability), 
and (2) making inference on the coefficients of the important regressors. There has been an 
extensive literature on addressing this problem (see for example. Fan and Li (2001), Donoho 
and Elad (2003), Donoho (2006), Zhao and Yu (2006), Candes and Tao (2007), Huang, 
Horowitz and Ma (2008), Lounici (2008), Zhang and Huang (2008), Wasserman and Roeder 
(2009), Lv and Fan (2009), Stadler, Biihlmann and van de Geer (2010), Biihlmann, Kalisch 
and Maathuis (2010), Belloni and Chernozhukov (2011b) and Raskutti, Wainwright and Yu 



Has the goal of chasing the oracle been really achieved? While the majority of the 
papers in the literature have given various conditions under which the oracle property can be 
achieved, they assume that all the candidate regressors are uncorrelated with the regression 
error term, namely, E{e^) = 0. More stringently, they assume 



This is a very restrictive and possibly unrealistic assumption, yet it is hard if not impossible 
to verify because of the high- dimensionality p. Without this assumption, all popular model 
selection techniques are inconsistent as to be shown in Theorems 2.1 and 2.2, which can lead 
to false scientific claims. Yet, violations to assumption (11. 2p arise easily as a result of selection 
biases, measurement errors, autoregression with autocorrelated errors, omitted variables, and 
from many other sources (Engle, Hendry and Richard (1983)). In high dimensional models, 
this is even harder (if not impossible) to avoid due to a large collections of regressors. Indeed, 
regressors are collected because of their possible prediction powers to the response variable 




(2011)). 
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Y. Yet, requesting equations (ll.2p or even more specifically 



E{Y-X.^f3o)Xj = 0, j = l,---,p (1.3) 

to satisfy is indeed a scientific fiction and is an irresponsible assumption without any vali- 
dations, particularly when p is large. 

For example, in a wage equation, Y is the logarithm of an individual's wage, and the 
objects of interest in applications include the coefficients of Xg such as the years of education, 
years of labor-force experience, marital status and labor union membership. On the other 
hand, widely available data sets from CPS (Current Population Survey) can contain hundreds 
or even thousands of variables that are associated with wage but are unimportant predictors. 
But, some of these variables can be correlated with y — X"^/3q (namely, e) too, due to the 
large pool of predictors. The analogy also applies to genomic applications in which gene 
expression profiles can also be correlated with the regression errors, making false selection 
of irrelevant genes for scientific outcomes. 

To solve the aforementioned issues, we borrow the terminology of endogeneity and exo- 
geneity from the econometric literature. A regressor is said to be endogenous when there is 
a correlation between the regressor and the error term, and is said to be exogenous other- 
wise. Broadly, a loop of causality between the independent variable and regressor can lead 
to endogeneity (Verbeek (2008) and Hansen (2010)). 

A more realistic and appealing model assumption should be: 

F = X^/3o + e = X^/3os + ^, EiY-Xlf3,s\^s) = 0, (1.4) 

where X^ and /3q5 denote the vector of important regressors and corresponding coefficients 
respectively, whose identities are, of course, unknown to us. This assumption is far easier to 
validate. One of the goals of this paper is to achieve the oracle property under model (II. 4p . 
in the presence of possible endogenous regressors. 

What makes the model selection possible is the idea of over identification. Let 5* be the 
set of important variables in model (11. 4p and 15*1 be the size of the set. For the set S, there 
exists a solution to the over-identified equations (with respect to (3^) such as 

EiY-Xl(3s)Xs = and i?(F - X^/35)X| = 0, (1.5) 

where X| is the vector consisting of squared elements of X5 and is used as an illustration. It 
can be replaced, for example, by jX^I or many other functions of X5. In the above equations, 
we have only \S\ unknowns, but 215*1 linear equations. Yet, the solution exists and is given 
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hy f3g = /3os- On the other hand, for other sets S of variables, the over-identified equations 



E{Y - X|/3_5)X5 = and E{Y - X|/3^)X| = 



(1.6) 



do not have a compatible solution unless S D S and the support of (3g is S and 



EeXg = and EeX| = 0, 



(1.7) 



where e = Y — 'Xg(3Qg. 

We show that in the presence of endogenous regressors, the classical penalized least 
squares method is no longer consistent. Under model (11. 4p . we introduce a novel loss function, 
called focussed generalized method of moments (FGMM), which differs from the classical 
generalized method of moments (Hansen, 1982) in that the instrumental variables depend 
irregularly on unknown parameters. The new FGMM fully appreciates the information 
contained in the moment condition (11.41) . and is powerful in detecting incorrectly specified 
moment condition of the form 



if Xi is endogenous. It is also very different from the low-dimensional techniques of either 
moment selection (Andrews 1999, Andrews and Lu 2001) or shrinkage GMM (Liao 2010) in 
dealing with misspecifications of moment conditions; the latter introduces one unknown pa- 
rameter to each possibly misspecified equation and is inappropriate in our high-dimensional 
endeavors. However, penalization is still needed in FGMM to avoid overfitting the model, 
since we allow some of unimportant predictors exogenous, satisfying (11.71) . This results in a 
novel penalized FGMM. The proposed FGMM successfully achieves the oracle property in the 
presence of endogeneity. In particular, the estimator converges in probability to /Bq^ at the 
near oracle rate Op{\J (s log s)/n) (Fan and Lv (2011)), and under certain over-identification 
condition, is a near global minimizer. In addition, it is shown that via a two-step procedure 
similar to ISIS (Fan and Lv, 2008) and post-lasso (Belloni and Chernozhukov, 2011a), we 
can achieve the semi-parametric efficiency in a more general nonlinear model. 

In addition, we consider a more general framework of the ultra high dimensional variable 
selection problem, and derive both sufficient and necessary conditions for a penalized mini- 
mization procedure to achieve the oracle property, where both the loss function (the leading 
term of the criterion function) and the penalty function can take a very general form. Many 
results on the oracle property in the literature can be understood as applications of these 
general theorems. 

We emphasize that the problem concerned in this paper is not a simple model misspecifi- 



EiX - x.lMXi ^ 
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cation, but rather a question about what kinds of model assumption are more reahstic, and 
about with which assumptions the empirical researchers feel comfortable. 

The remainder of this paper is as follows: Section 2 gives a necessary condition for a 
general penalized regression to achieve the oracle property. We also show that in the pres- 
ence of endogenous regressors, the penalized least squares method is inconsistent. Sections 3 
constructs a penalized FGMM to solve the problem of endogeneity, and discusses the ratio- 
nale of our construction as well as its numerical implementation. Section 4 gives sufficient 
conditions for establishing the oracle property for a general penalized regression. Section 
5 apphes these conditions to show the oracle property of FGMM. Section 6 discusses the 
global optimization. Section 7 is concerned about the semi-parametric efficient estimation of 
the non-vanishing parameters. Simulation results are demonstrated in Sections 8. Finally, 
Section 9 concludes. Proofs are given in the appendix. 



Throughout the paper, let Amin(A) and Ainax(A) be the smallest and largest eigenvalues 
of a square matrix A. We denote by ||A||, ||A||2 and ||A||oo as the Frobenius, operator and 
elementwise norms of a matrix A respectively, defined respectively as ||A|| = tr^/^(A^A), 



equal to the Euclidean norm. For two sequences a„ and 6„ 7^ 0, write a„ ^ 6„ (equivalently, 
bn 3> a„) if a„ = o(6„)- |/3|o denotes the number of nonzero components of a vector f3. In 
addition, P^(t) and Pn{t) denote the first and second derivatives of a penalty function Pn{t). 
Finally, we write w.p.a.l as brevity for "with probability approaching one". 

2 Necessary Condition for Variable Selection Consis- 
tency 

2.1 Penalized regression and necessary condition 

Let s denote the number of nonzero coefficients of /3q. For notational simplicity without 
loss of generality, it is assumed throughout the paper that the coordinates are rearranged 
so that the non-vanishing coordinates of /3q are the first s coordinates, denoted by ^Qg. 
Therefore, the true structural parameter can be partitioned as /3q = (/3o5) 0onY ^ with 
/3oAr — 0. Accordingly, the regressors can be partitioned as X = (X5, X^)-^, called important 
regressors and unimportant regressors respectively. The sparsity structure typically assumes 
that the number of important regressors s — dim(X5) grows slowly with the sample size: 
s — o{n). 



Notation 




When A is a vector, both || A|| and || A||2 are 
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A penalized regression problem in general takes a form of: 



minL„(/3) + ||P„(/3)|h 



where -Pn(-) denotes a penalty function and ||P„(/3)||i = Pn{\f3j\)- While the current 

literature has been focusing on the sufficient conditions for the penalized estimator to achieve 
the oracle property, there is relatively much less attention to the necessary conditions. Zhao 
and Yu (2006) derived an almost necessary condition for the sign consistency. Zou (2006) 
provided a necessary condition for the variable selection consistency of the least squares 
estimator with Lasso penalty when p/n 0. To the authors' best knowledge, so far there 
has been no necessary condition on the loss function for the selection consistency in the ultra 
high dimensional framework. Such a necessary condition is important, because it provides us 
a way to justify whether a typical loss function can result in a consistent variable selection. 

Theorem 2.1 (Necessary Condition). Suppose: 
(i) Ln{(3) is twice dijferentiable, and 



max 

i<i,j<P 



df3idl3, 



Op(l) 



(ii) There is a local minimizer (3 = {I3s,I3n) ^/ 

L„(/3) + ||P„(/3)|h 



such that P{f3j^ = 0) — )■ 1, and y/s\\f3 — /3ol| = Op(l). 

(Hi) The penalty satisfies: Pn{-) > 0, Pn(0) = 0, P^(t) is non-increasing when t G (0,m) for 
some M > 0, and lim„__^oo lim(_^o+ Pnif) = 0. 
Then for any I such that (3oi = 0, 



df3i 



0. 



(2.1) 



Note that the conclusion ( 12. ip differs from the Karush-Kuhn- Tucker (KKT) condition in 
that it is about the gradient vector evaluated at the true parameters rather than at the local 
minimizer. The conditions on the penalty function in (iii) are very general, and are satisfied 
by a large class of popular penalties, such as Lasso (Tibshirani 1996), SCAD (Fan and Li 
2001) and MCP (Zhang 2009), as long as the tuning parameter A„ — )■ 0. Hence this theorem 
should be understood as a necessary condition imposed on the loss function instead of the 
penalty. 
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2.2 Inconsistency of least squares with endogeneity 

As an important application of Theorem 12.11 consider the simple linear model: 



y = X^/3o + £ = X^/3o5 + £, (2.2) 

where E{e\l^s) = 0- However, we may not have E{e\^) = 0. 

The conventional penalized least squares (PLS) problem is defined as: 

1 " 

1=1 

In the simpler case when s, the number of non- vanishing components of (3^, is bounded, it 
can be shown that if there exists some unimportant regressor correlated with the regression 
error e, the PLS does not achieve the variable selection consistency. This is because the 
necessary condition in (12. ip does not hold for the least squares loss function. Hence without 
the ad-hoc exogeneity assumption, PLS would not work any more. 

Theorem 2.2 (Inconsistency of PLS). Suppose s = 0(1), and X^r has an endogenous 
component Xi, that is, \E{Xie)\ > c for some c > 0. Assume that EXf < oo, Ee* < oo, 
and Pn{t) satisfies the conditions in Theorem \2.1\ If 

corresponding to the coefficients of (X5,Xjv), is a local minimizer of 

n 

-Y.^Y,-y.^l3f+\\PM\\i. 

i=l 

then either \\f3g — /Josll 0, or 

limsupP(3^ = 0) < 1. 

n— >oo 

We have conducted a simple simulated experiment to illustrate the impact of endogeneity 
on variable selection. Consider 

Y = X^(3o + e, e^N{0,l), 

f3os = (5, -4, 7, -1, 1.5); /3o, = 0, for 6 < j < p. 

Xj = Zj for j < 5, Xj = {Zj + 5)(e + 1), for 6 < j < p. 

Z ~ Np{0, S), independent of e, with (S)^^ = 0.5l^~^l, 
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Table 1: Performanceof PLS and FGMM over 100 replications, p = 50, n = 300 



PLS FGMM 





A = 0.05 


A = 0.1 


A = 0.5 


A = 1 


A = 0.05 


A = 0.1 


A = 0.2 


A = 0.4 


MSEs 


0.145 


0.133 


0.629 


1.417 


0.261 


0.184 


0.194 


0.979 




(0.053) 


(0.043) 


(0.301) 


(0.329) 


(0.094) 


(0.069) 


(0.076) 


(0.245) 


MSEtv 


0.126 


0.068 


0.072 


0.095 


0.001 





0.001 


0.003 




(0.035) 


(0.016) 


(0.016) 


(0.019) 


(0.010) 


(0) 


(0.009) 


(0.014) 


TP-Mean 


5 


5 


4.82 


3.63 


5 


5 


5 


4.5 




(0) 


(0) 


(0.385) 


(0.504) 


(0) 


(0) 


(0) 


(0.503) 


FP-Mean 


37.68 


35.36 


8.84 


2.58 


0.08 





0.02 


0.14 




(2.902) 


(3.045) 


(3.334) 


(1.557) 


(0.337) 


(0) 


(0.141) 


(0.569) 



MSEs is the average of ||/3_5 — /3os|| for non-vanishing coefficients. MSE^ is the average of 
11/3 AT — (^qnW foT^ coefficients. TP is the number of correctly selected variables, and FP 
is the number of incorrectly selected variables. The standard error of each measure is also 
reported. 



In the design, the unimportant regressors are endogenous. The penalized least squares 
(PLS) with SCAD-penalty was used for variable selection. From Table [H PLS selects 
many unimportant regressors (FP-Mean). In contrast, using the proposed method penalized 
FGMM (to be introduced) we can do an excellent job in both selecting the important re- 
gressors and eliminating the unimportant regressors. Yet, the inefficiency of (5s by FGMM is 
due to the moment conditions used in the estimate. This can be improved further in Section 
7. 

3 Focussed GMM 
3.1 Definition 

Instead of the linear regression (11. ip . in this paper we will consider a more general frame- 
work: 

i5;[^7(F,X^/3o5)|X5]=0, (3.1) 

where Y stands for the dependent variable; (yfiMxM— j-Risa known function. For 
simplicity, we require that g be one-dimensional, and should be thought of as a possibly 
nonlinear residual function. Our result can be naturally extended to mult i- dimensional 
conditional moment restrictions. 
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Model (13. ip is called a conditional moment restricted model, which has been extensively 
studied in the literature: Newey (1993), Donald, Imbens and Newey (2003), Kitamura, 
Tripathi and Ahn (2004), etc. Some of the interesting examples in the generalized linear 
model that fit into (13. ip are: 

• simple linear regression, g{ti,t2) = ti — t2] 

• logit model, g{ti, t2) = ti - exp(t2)/ (1 + exp(t2)); 

• probit model, g{ti, ^2) = — '^'(^2) where $(■) denotes the standard normal cumulative 
distribution function. 

The conditional moment restriction (13. ip implies that 

E[g{Y, X.lf3,s)^s] = 0, and E[giY, X.l(3,s)^l] = 0, (3.2) 

where X| denotes a vector of squares of X5 taken coordinately and can be replaced by 
any other nonlinear functions such as \^s\ (assuming each variable has mean 0). A typical 
estimator based on moment conditions like (13. 2 p can be obtained via the generalized method 
of moments (GMM, Hansen 1982). However, in the problem considered here, (13. 2 p cannot 
be used directly to construct the GMM criterion function since the true identities of X5 
are unknown to us. On the other hand, as explained in the introduction, the over-identified 
equations (13. 2p do not have a solution for other sets that support f3. 

To take advantage of the above intuition, let us introduce some additional notation. For 
any (3 G M^/{0}, and i = 1, ...,n, define r = |/3|o-dimensional vectors 

X,(/3) = (X,,,„...,X,,J^and X,2(/3) = (X^^, X^J^, 

where (/i,...,/^) denote the indices of the non-vanishing components of (3. For example, if 
p = 3 and /3 = (1,0,2)^, then X,(/3) = {Xa,X,^f, and X.^,{(3) = {Xl,Xlf, i < n. 

The FGMM weight matrix is specified as following: for each j = l,...,p, let Xj = 
I , X] = i Er=i and define 

^r(X,) = - 5^(X,, - X,)^ w(X|) = - - X])\ 

1=1 1=1 

which are the sample variances of Xj and X| respectively. The (2|/3|o) x (2|/3|o) FGMM 
weight matrix is given by a diagonal matrix 

W(/3) =diag{w(XzJ-\...,w(X,J-\w(XfJ-\...,w(XfJ-i}, 
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whereas again, (/i, ...,1^) denote the indices of the non-vanishing components of f3. 
Let 

V.(/3) = f >' 

Our Focussed Generahzed Methods of Moments (FGMM) loss function is defined as 



-t^FGMM 

p 



varfX, ) \ n ^ 



-f^<7(y„Xf/3)V,(^) W(/3) -X]^?(F„Xf^)V,(/3) 

The loss function is a weighted average of two quadratic terms 
(^Er=i5(^i,Xf/3)X,,)' and (iEr=i^?(^-Xf/3)X2)'. As in the same spirit of the regular 
GMM's optimal weight matrix, the weights depend on the variance of the instrumental 
variables X(/3) and X^(/3), and help to standardize the moment conditions. 

The term X^(/3) is used here as an example. Other instrumental variables Vj(/3) can 
also be used. An obvious example is to replace X^(/3) by |X(/3) — X(/3)| in which X(/3) is 
the sample mean vector of X(/3). Unlike the traditional GMM, the instrumental variables 
Vj(/3) depend on the unknown /3 and is not continuous in /3. As to be further explained 
below, this allows to focus only on the equations with correct specifications and is therefore 
called the focussed GMM or FGMM for short. We then defined the FGMM estimator by 
minimizing the following criterion function: 



Qfgmm(/3) — -^^fgmm(/3) + ||P„(/3)||i. 



(3.3) 



The penalty function ||P„(/3)||i is also needed, because the indicator function in Lfgmm itself 
only plays a role of sure-screening, which is not enough to guarantee the variable selection 
consistency. Sufficient conditions on the penalty function for the oracle property will be 
presented in Section HI 
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3.2 Rationales behind the construction of FGMM 



3.2.1 Inclusion of V(^) 

We construct the FGMM criterion function using 

V(/3) = (X(/3r,X^(/3ff- 

A natural question arises: including X^(/3) seems ad-hoc; why not just use V(/3) = X(/3)? 
We now explain the rationale behind the inclusion of the term such as X^(/3). 

Let us consider a linear regression model (11 .4^ as an example. If X^(/3) were not included 
and V(/3) = X(/3) had been used, the GMM loss function would have been constructed as 



1 A 



-}^(F,-X /3)X,;(/3) 



W(/3) 



1 A 



-> rF,-X,^/3)X,(/3) 



n 



For simplicity of illustration, we assume that W(/3) is the identity matrix, and use the /q 
penalty Pn{\^j\) = \nI{\(3j\^o)- 

Suppose that the true /Sq = (/^q^., 0, 0)^ where only the first s components are non- 
vanishing and that s > 1. If we, however, restrict ourselves to /3p = (0, 0, /3p), the criterion 
function now becomes 



Qfgmm(/3p 



1 " 

^ 1=1 



It is easy to see its minimum is just A„ under mild conditions although /3o,p = 0. On the 
other hand, if we optimize Qfgmm on the true parameter space /3 = {(3^, 0)"^, then 

min (5fgmm(/3) = min L^(/3) + sA„ 

> sXn.. 



As a result, minimizing Qfgmm is inconsistent for variable selection. 

Including an additional term X^(/3) in V(/3) can overcome this problem. Since the 
number of equations in 

E[{Y - X^/3)X(/3)] = and E[{Y - X.^ (3)X.'^ {f3)] = (3.4) 

is twice as many as the number of unknowns (non- vanishing components in f3), it is very 
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unlikely to have some (3 other than /3q to satisfy (13. 4p . As a result, if we define 

G{f3) = \\E{Y - X^/3)X(/3)||2 + \\EiY - X^/3)X2(/3)f , 

the population version of I/fgmm, then as long as f3 is not close to fB^, G should be bounded 
away from zero. Therefore, it is reasonable for us to assume that for any e > 0, 

inf G(f3) > 6 (3.5) 

||/3-/3ol|oo>e,/37^0 

for some 6 > 0. Due to condition (13. 5p and that G{(3q) = 0, implied by the model assumption 
E{Y — X^/3o5|X5) = 0, minimizing Lfgmm forces the estimator to be close to /3o- 

It can be seen that instead of X^(/3), one can include other transformations of X(/3) such 
as the trigonometric functions in V(/3) to construction FGMM, as long as 

inf ||E^(r,X^/3)V(/3)f >5. 

||/3-/3o||oo>£,/37^0 

The specific choice of V(/3) would not affect the oracle property, but only matters in the 
asymptotic variance of the estimator (see Sections [5] and [7| for details). 



3.2.2 Indicator function 

We handle the problems of ultra-high dimensionality and model mis-specification simul- 
taneously by including an indicator function /(/s^^o) in the loss function. As a result, the 
instrumental variables V(/3) depend on the parameter /3, which leads to the novel focussed 
GMM. We now explain the rationale behind it. 

Recently, there has been a growing literature on the shrinkage GMM, e.g., Caner (2009), 
Caner and Zhang (2009), etc, regarding estimation and variable selection based on a set of 
moment conditions like (13. 2p . The model considered by the authors above, besides restricted 
to specific penalty functions, significantly differs from ours, in that the moment conditions 
they considered are all correctly specified. More recently, Liao (2010) considered GMM with 
mis-specified moment conditions, but in a low dimensional parameter space, and use a very 
different idea. 

In contrast, because we allow the presence of possibly endogenous regressors, the moment 
conditions of the form 

E[g{Y,X^f3,)X]=0 

are subject to mis-specification on some endogenous regressors. While only the important 
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regressors are assumed to satisfy 

E[g{Y, Xi/3o5)X5] = and E[g{Y, X^/3o5)X|] = 0, 

the identities of the correct moment conditions are unknown to us. Without the indicator 
function in the definition of Lygmm{(3')-, the oracle estimator can still have a large objective 
value due to the endogeneity of other predictors. Therefore the oracle estimator is not 
necessarily the minimizer. 

Including the indicator function in Lfgmm(/3) eliminates the endogenous regressors. In 
addition, it automatically performs a sure- screening procedure that produces a sparse so- 
lution. Unless the support S{f3) of f3 contains the true variables in 5, -^vfgmm(/3) is large. 
Among those S{(3) D S, some variables can be exogenous, satisfying (11.71) . The choice of 
zero or small coefficients are allowable when only Lfgmm(/3) is to be minimized without a 
penalty, whereas the penalty term in (13. 3p makes this choice infeasible. 

3.3 Implementation 

We now discuss the implementation for numerically minimizing the penalized FGMM 
criterion function. 

3.3.1 Smoothed FGMM 

As we discussed above, including an indicator function benefits us greatly in dimension 
reduction as well as in handling endogeneity. However, it also makes Lfgmm unsmooth. For 
each fixed subset 5* C {1, this criterion function is continuous in /3 on {/3 G : Pj = 

if j ^ S}, but is not continuous in /3 globally on R^. As there are 2^ subsets of {1, 
minimizing Qfgmm{(3) = LFGMM(/3)+Penalty is generally NP-hard, that is, there are no 
algorithms to solve the problem in a polynomial time. 

We overcome this discontinuity problem by applying the smoothing technique as in 
Horowitz (1992), which approximates the indicator function by a smooth kernel K : 
(— oo, oo) — i- M that satisfies 

1. < K{t) < M for some finite M and all t > 0. 

2. K{0) = and lim\t\_,^K{t) = 1. 

3. limsup|4|^^ \K'(t)t\ = 0, and limsup|j|_^o^3 \K"(t)t^\ < oo. 

We can set K{t) = , where F{t) is a twice differentiable cumulative distribution 

function. For a pre-determined small number hn, -Z^fgmm is approximated by a continuous 
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function in /3: 



+ 



/5| 



var 



Note that as /i„ — !■ 0"*", K{f3j/hn) converges to /(/j^.^o); and hence Lk{(3) is simply a 
smoothed version of Ivfgmm(/3) for finite sample. As an illustration, Figured] plots Kit^ /hn) 
as a function of t using the logistic cumulative distribution function, where 



K - 



exp(tV/in) - 1 

exp(tv/i„) + r 



Figure 1: (£ 



exp(tVfcn)-l 

Cxp(i2//(-n) + l 



as an approximation to I(t^o) 




3.3.2 Coordinate descent algorithm 

After smoothing the indicator function by a kernel K{-), we employ the iterative coordi- 
nate algorithm for the FGMM minimization, which was used by Fu (1998), Daubechies et al. 
(2004), Fan and Lv (2011), etc. The iterative coordinate algorithm minimizes one coordi- 
nate of /3 at a time, with other coordinates kept fixed at their values obtained from previous 
steps, and successively updates each coordinate. The penalty function can be approximated 
by LLA (local linear approximation) as in Zou and Li (2008). 

Specifically, we run the regular penalized least squares to obtain an initial value, from 
which we start the iterative coordiate algorithm for the FGMM minimization. Suppose 
Z?*^'-* is obtained at step /. For k G {1, denote by a (p — 1) -dimensional vector 

consisting of all the components of but Write (/3|'2fc)?^) the p-dimensional vector 
that replaces with t. The minimiztion with respect to t while keeping fixed is then 
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a univariate minimization problem, which can be carried out by a golden section search. To 

'(-fc)' 



speed up the convergence, we can also use the second order approximation of LK{(3f\),t) 



along the kth component: 



L!<{0f-i,,t) (3.6) 

Lx(/3(')) + Lx(/3fL,t). 



'(-fc) 

We solve for 



t* = e^igmin LK{f3fl,„t) + P^(|/3f |)|t|, (3.7) 



t 



'i-k) 



which admits an explicit analytical solution. We keep the remaining components at step I. 
We accept t* as an updated /cth component of /3*-'^ only if LxifS^'"'') + Yl^=i Pn{\f3^P\) strictly 
decreases. 

The algorithm runs as follows. 

1. Set 1 = 1. Initialize /B^ ' = (3 , where (3 solves for 

H n p 

i=i j=i 

using the coordinate descent algorithm as in Fan and Lv (2011). 

2. Successively for k = 1, ...,p, let t* be the minimizer of 

minL,,(/3f2,),t) + P^(|/3f|)|t|. 



t 



If 

r ..(r- 

'i-k) 



LK{f3fl,„ n + P„(|r I) < L^(/3«) + P.(|/3«|), 



update as t*. Increase / by one when k = p. 

3. Repeat Step 2 until convergence or / reaches a pre-determined maximum number of 
iterations. 

When the second order approximation (13. 6p is combined with SCAD in Step 2, the local 
linear approximation of SCAD is not needed. As demonstrated in Fan and Li (2001), when 
Pnit) is defined using SCAD, the penalized optimization of the following form min^g^ ^{z — 
/3)^ + AP„(|/3|) has an analytical solution. 
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4 Oracle Property of Penalized Regression for Ultra 
High Dimensional Models 



FGMM involves a non-smooth loss function. We need to first develop a general asymp- 
totic theory in ultra high dimensional models to accommodate this. Sufficient conditions of 
the oracle property are given when both the loss and penalty functions take general forms. 
Then in Section 5, the general theory will be apphed to the newly proposed FGMM. 



4.1 Penalty function 

Fan and Li (2001) and Lv and Fan (2009) proposed a class of penalty functions that 
satisfy a set of general regularity conditions for the variable selection consistency. In this 
paper, we consider a similar class of penalty functions. 

For any /3 = {/3i, e W, and ^0,j ^ 1, s, define 

77(/3) =limsupmax sup _ Pnit2) - Pnih) ^ 

£_^0+ -J^* ti<t2 t2 — ti 

{ti,t2)e{\M-e,\^j\+6) 

which is maxj<s —P^(\l3j\) if the second derivative of P„ is continuous. Let 

dn = ^niin{|/3oj| : Poj 7^ 0,j = l,...,p} 

represent the strength of signals. 

We now define a class of penalty functions to be used throughout the paper: 

Assumption 4.1. The penalty function Pn{t) : [0, 00) M satisfies: 

(i) P„(0) = 

(ii) Pn(t) is concave, increasing on [0, 00), and has a continuous derivative P^it) when t > 0. 

(Hi) y/sP^{dn) = o{dn)- 

(iv) There exists c > such that sup^gg^^^^^^^^) 77(/3) = o(l). 

The concavity of Pn{-) imphes that ri{0) > for all (3 e R*. These conditions are stan- 
dard, which are needed for establishing the oracle properties of the penalized optimization. It 
is straightforward to check that with properly chosen tuning parameters, the Ig penalty (for 
q <1), hard-thresholding (Antoniadis 1996), SCAD (Fan and Li 2001), and MCP (Zhang 
2010) all satisfy these conditions. 



16 



4.2 Oracle property of general penalized regression 

The following theorems provide sufficient conditions for the penalized regression (GMM, 
maximum likelihood, least squares, etc.) to have oracle properties in ultra high dimension. 

Define S = {j e : /3oi ^ 0}, and B = {f3 e W : /3j = if j ^ S}. The 

variable selection aims to recover 5* with high probability. Our ffist theorem restricts the 
penalized optimization onto the s-dimensional subspace B, which is the oracle parameter 
space. Though infeasible in practice, it gives us an idea of the oracle rate. 

In the theorems below, write Ln{f3g,0) = Ln{(3) for (3 = (0^,0)^ G B. Let (3g = 
Wsu -^l^ss) and 

V dPsi dpss J 

Theorem 4.1 (Oracle Consistency). Suppose dn = 0(1), sj^fn = o(c/„) and Assumption 
\4.1\ is satisfied. In addition, suppose L„(/35, 0) is twice differentiahle with respect to (3^ in a 
neighborhood of (3^^ restricted on the subspace B, and there exists a positive sequence {an}'^=i 
such that an/dn 0, and a constant c > such that: 

(a) The Hessian matrix V|L„(/35,0) is element-wise continuous within a neighborhood of 
/3o5, and with probability approaching one, 

[VgLni/Bg, 0)) > c. 



-T 



Then there exists a strict local minimizer {f3g,0) of 
subject to {0^, 0)^ G B such that 

0S-M=OMn + V~sP'n{dn)). 

For a penalized regression estimator, the rate of convergence depends on both 
II V5'L„(/3q_5, 0)11 and the penalty P„. Condition (i) requires that the score function should be 
asymptotically unbiased, whose rate is usually the leading term of the rate of convergence of 
the estimator. Condition (ii) ensures that asymptotically the Hessian matrix of Ln{l3g,0) is 
positive definite in a neighborhood of (Bq^. Both conditions are satisfied by the likelihood- 
type loss function considered in Fan and Lv (2011) and Bradic, Fan and Wang (2011). It will 
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be shown in the next section that FGMM can achieve the near-oracle rate Op(a/(s log s)/n). 

The previous theorem assumes that the true support S were known, which is not practical. 
We therefore need to derive the conditions under which S can be recovered from the data with 
probability approaching one. This can be done by demonstrating that the local minimizer 
of Qn restricted on B is also a local minimizer on W. The following theorem establishes the 
sparsity recovery (variable selection consistency) of the estimator, defined as a local solution 
to a penalized regression problem on R^. 

For any (3 G W , define the projection function 



„ , Bj ii j E S 

T/3 = /3^,...,/3;fei3, /3;=r' 



Theorem 4.2 (Sparsity recovery). Suppose Ln 
orem 



—7- M satisfies the conditions in The- 



4-1, and Assumption \4-l\ holds. In addition, for (3g in Theorem 4-1, there exists a 
neighborhood Mi C W of {f3g, 0)'^ , such that for all 7 G Mi\B, with probability approaching 
one, 

L„(T7)-L„(7)<$^P„(|7,|). (4.2) 

its 

Then with probability approaching 1, (/3^,0)'^ is a strict local minimizer of 

Q„(/3) = L„(/3) + ||P„(|/3|)||i 



in W. In particular, if is twice differentiable in a neighborhood of then ^4-^ holds 
with probability approaching one, if y/s{an + y/sP'{dn)) = o(P^(0'^)), 



max 

l(^S 



Op(P^(0^)), and max 



i<p,i<p 



5'^n(/3o) 



where we denote P'niO^) = liminf(_^o+ Pnif)- 

Condition (14.21) is a high-level condition. Due to 



Op(l), 



(4.3) 



it almost is the proof of the theorem. It is imposed here because we want to allow L„(/3) to 
be possibly nonsmooth, which is often seen in quantile regression (Belloni and Chernozhukov 
2011b), and in our proposed FGMM. On the other hand, if -L„(/3) is assumed to be twice 
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different iable, such a high level condition can be verified, and a sufficient condition (14. 3p is 
provided. 

For statistical inference, we have the following theorem on the asymptotic normality. Let 
sgn(-) denote the sign function. 



Theorem 4.3 (Asymptotic normality). Suppose the assumptions in Theorem hold, and 
there exists an s x s matrix f2„, such that: 
(i) For any unit vector a. G Mf , ||q:|| = 1, 



(n) 



/P^(|/35i|)sgn(/35i)\ 



\P'S\Pss\>g^0Ss)) 

Then for any unit vector o: G M'^ with ||a|| = 1, 



a^n„V|L„(/3os, 0)(/3s - M N{Q, 1). 
Therefore, the combination of the above theorems implies that, under the conditions 



of Theorems I4.HI4.3[ Qn{^) has a strict local minimizer in MP that can be partitioned as 
/3 = {(3g, where the coordinates of (3g are inside S, such that 

ll35-/3osll=Op(a„ + v^P^(rf„)), 



lim P(/3^ = 0) = 1, 

n— >oo 

and in addition, (Bg is asymptotically normal. 

These sufficient conditions for the variable selection and parameter estimation are very 
general and not limited to any specific model. We will see in the next section that, with 
mild regularity conditions on the moments, all the conditions in Theorems 14. ![ 14.21 and 14.31 
are satisfied by the penalized FGMM in conditional moment restricted models. 



5 Oracle Property of FGMM 

With the help of general penalized regression theory, we are now ready to derive the 
oracle property of the penalized FGMM procedure. The following assumptions are imposed. 
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Assumption 5.1. (i) The true parameter /3q is uniquely identified by E{g{Y,X.'^ (3q)\'K.s) = 
0. 

(a) (li,Xi), (y„,X„) are independent and identically distributed. 
Assumption 5.2. There exist 6i, 62 > and ri,r2 > such that for any t > 0, 

(t) P(|(7(r,X^/3o)| > t) < eM-it/biY'), 

(11) maxKpP(|Xi| > t) < exp(-(t/62)'"2). 

(Hi) minig5 var(5f(y, X"^/3Q)Xi) is bounded away from zero. 

(iv) var(X;) and va.T{Xf) are bounded away from both zero and infinity uniformly in I = 
1, ...,p and p > I. 

This assumption requires tliat both the regression residuals and the important regressors 
should have exponential tails, which enables us to apply the large deviation theory to show 
\\n~^ Ym=i dO^i^^I f^o)^is\\ = Op{^/s\ogsJn) . A simple example in which this assumption 
is satisfied is that g(Y, X'^/Jq) and X^ are Gaussian. 

We will assume g{-, ■) to be twice differentiable, and in the following assumptions, let 

/, , N dg{ti,t2) , . d'^g{ti,t2) 
m{ti,t2) = , g(ti,t2) = ^ , 



dt2 ' " ^' dtl 



Vc 




Assumption 5.3. (?(■,■) is twice differentiable, sup^^^^^ '^a)! < 00, and 

This assumption is satisfied by the simple linear regression, logistic regression, probit 
model, and most of the interesting examples in the generalized linear model. 

Example 5.1. In linear regression, m{ti,t2) = —1. In logistic regression, m{ti,t2) = 

(iSSw < i' l^^'^^'^^)! = r^'^tSpX))'^" ! < 1- probit regression, m(ti,t2) = < 
(27r)-i/2, |g(ti,t2)| = |t20(t2)|<(27re)-i/2. 

Assumption 5.4. There exist Ci > and C2 > such that 

Xr.^[{Em{Y,Xl(3,s)^syl){Em{Y,X.lf3,s)^sylf] < Ci- 

A^in[(Sm(r,X^/3o5)X5V^)(i?m(y,X^/3o5)X5V^)^] > C2; 

The first condition is needed for (3g to converge at a near oracle rate, that is, a„ = 
OpiV (s log s)/n) for a„ in Theorem 14. 1[ The second condition ensures that the Hessian ma- 
trix of LpQMuif^g, 0) is positive definite at /Sgg. In the generalized linear model. Assumption 



20 



15.41 is satisfied if proper conditions on tlie design matrices are imposed. For example, in tlie 
linear regression model, we assume 



Cl < Ainin(-E'X5X^) < Ajnaxl-E'XsX^) < C2, 

and 

Cl < Amin(-EX5X|^£'X|X^) < Amax(-E'XsX|"^_E'X|X^) < C2; 

In the probit model. Assumption 15.41 holds if 

and similar inequalities hold for i?0(X|^/3o5)X5X|^, where 0(-) is the standard normal den- 
sity function. Conditions in the same spirit are also assumed in Bradic, Fan and Wang (2011 
Condition 4), and Fan and Lv (2011, Condition 4). 

Assumption 5.5. There exist two nonnegative sequences i^n = 0{y/s) and rjn = 0{y/s) 
such that 

max\\Em{y,X^(3o)XiYsf = 0(^,1), 
i<^s 

maxA,nax[i5m(y,X^/3o)'X|V5V^] = 0{r]l), 

and 

s«:„r7„(v/(b^W^ + P^(rfn)) = o(P;(0+)). 

This assumption is needed to satisfy condition (14. 2 p in Theorem 14.21 For the ordinary 
linear model, the above assumption is a statement on 

max II EX; V5 II , and max A^ax [EX^Ys^l] 
i^s ie5 

which imposes some restrictions on the correlation between the important and unimpor- 
tant regressors once the data are centered. In general, the above assumption imposes 
some restrictions on the order of the weighted covariance. By Assumptions 15.21 and 15.31 
the first two equalities hold with Hn = Vn = \/^- Therefore, without the first two as- 
sumptions in Assumption 15. 5[ the oracle property in Theorem 15.11 below still holds if 
s^Pnidn) + s^^/hgs/n = o(P;;(0+)). This is satisfied by SCAD and MCP if the tuning 
parameter satisfies s^^/log s/n <^ Xn dn and by l'^ penalty (g < 1) if An-y/i = o{d^'^). 

On the other hand, when covariates are weakly correlated, we can take smaller order 
tin and Tin than the upper bound y/s. This relaxes the third requirement in Assumption 
15. 5[ and hence the restrictions on the number of important regressors s and the strength 
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of the minimal signal In particular, when Kn = rjn = 1, our restriction reduces to 

Under the foregoing regularity conditions, we can show the oracle property of a local 
minimizer of the FGMM flOD . 



Theorem 5.1. Suppose sj ^fn = o{dn), and logp = o{n). Under Assumptions \4-l\ \5.1W5.5\. 
there exists a strict local minimizer (3 = {fig^fi^Y' of Qfgmm{/3) such that: 

0s - M = {slogs) /n + v^P^K)), 

where (3g is a subvector of f3 whose coordiates are in S, and 
(n) 

lim P(3;v = 0) = 1. 

n— ^oo 

Remark 5.1. 1. We only require X5 to be uncorrelated with the error term. In other 
words, even if some of the components in Xat are endogenous, penalized FGMM can 
still achieve the variable selection consistency. 



2. The near oracle rate ||/3s'— /^os'll = Op{^J s \ogs/n) is attained if P^idn) = 0{^J\og s/n). 
This is satisfied, for example, by SCAD and MCP if the tuning parameter A„ = o((i„). 

The asymptotic normality requires an additional assumption as follows. Define 

Vo = var((7(r,Xi/3o5)V5). (5.1) 
Assumption 5.6. (i) For some c > 0, Amin(Vo) > c. 

(11) P:,{dn) = o{l/^). 

(ill) There exists C > 0, sup||^_^^^||^^y^j^^^ r/(/3) = o((slogs)-^/2)^ 

Conditions (ii) and (iii) are satisfied by the penalty functions SCAD, and MCP. For ex- 
ample, for SCAD, ^^V\\f^_f^^^\\<CyJ {s\ogs)/n ^(^) = ^ ^^^^ "^^ + \fs\ogsJn = o{dn). However, 
they are not satisfied by /g-penalty (g G (0,2)), or the elastic net (Zou and Hastie (2005)). 

Theorem 5.2 (Asymptotic Normality). Under the conditions in Theorem \5.1\ and Assump- 
tion \5.(j\ the penalized FGMM estimator in Theorem 15. il satisfies 

^cx'T-^/^llSs - M NiO, 1), 
for any unit vector a , ||a|| = l, where 

r„ = 4A„W(/3o)VoW(/3o)A^, S„ = 2A„W(/3o)A^, 
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A„ = Em(y,X'^/3o)X5V|. 



6 Global minimization 

Theoretical analysis of minimizing a nonconvex criterion function for large p has so far 
focused on the properties of a specific local minimizer (e.g., Lv and Fan (2009), Bradic et al. 
(2011)). A natural question to ask is that how close is such a local minimizer to the global 
solution? 

In the GMM literature, when the parameter satisfies a set of moment conditions whose 
dimension is larger than that of the parameter, the parameter is said to be over-identified. 
Relating the over-identification issue to the problem here, we can then show that the local 
minimizer in Theorems 15.11 and 15.21 can also be made nearly global. 

For a fixed 6, define an Zoo ball centered at /3q with radius 6: 

Qs = {f3 : \/3i - ^oi\ < S,i = 1, ...,p}. 
Assumption 6.1 (over-identification). For any 6 > 0, there exists e > such that 



lim P I inf 

n-i-oo I l3^esU{0} 



1 " 

-J]^(F„Xf/3)V,(/3) 



n 

i=l 




This is a high-level assumption that is, however, hard to avoid in ultra-high dimensional 
problems. It is the empirical counterpart of fl3.5p . We now explain the rationale behind this 



assumption. As in the discussion of Section 13.21 the number of equations in 

E[g{Y,yJf3)lLm = Q and E[g{Y,yJ (3)y.\f3)] = Q (6.1) 

is twice as much as the number of unknowns (non- vanishing components in (3). As a result, 
the above simultaneous equations are in general incompatible (that is, have no solution) 
unless j3 is on the true parameter space [3 = (/3^, 0)^. In other words, (El]) has a unique 
solution P = (3q and it is reasonable to assume that ||^ 5'(^, Xf/3)Vi(/3)|| is bounded 
away from zero whenever (3 is not close to (3^. 

We impose this assumption on the empirical counterpart instead of the population for 
technical reasons. Under ultra-high dimensionality, the accumulation of the approximation 
errors from using the law of large number is no longer negligible, and as a result, it is chal- 
lenging to show that ||E[^(F,X^/3)V(/3)]|| is close to \\lY.l=i9{Yu^^t uniformly 
for high dimensional (3. 
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Theorem 6.1. Assume maxjt^s Pni\f^oj\) = o{s ^). Under Assumption \6.1\ and those of 
Theorem \ 5.1\ the local minimizer (3 in Theorem I5.il satisfies: for any 5 > 0, there exists 
e > 0, 

lim P I Qfgmm0) + £ < inf Qfgmm{P) ) = 1- 

n-i-oo Y ^^0iU{O} J 

Remark 6.1. 1. The resuh stated in this theorem is near global, in the sense that it 
excludes the set {0} from the searching area because (5fgmm(0) = by definition. It 
is reasonable to believe that is not close to the true parameter, since we assume 
there should be at least one important regressor in the model. In addition, our global 
minimization result is based on an over-identification assumption, which is essentially 
different from the global minimization theory in the recent high dimensional literature, 
e.g., Zhang (2010), Zhang (2010), Biihlmann and van de Geer (2011, ch 9), and Zhang 
and Zhang (2012). 

2. Assumption 16.11 can be relaxed a bit in that e is allowed to decay slowly at a certain 
rate. The lower bound of such a rate is given by Lemma [D. 21 in the appendix. 

3. Including finitely many transformations of X in V also enables us to achieve the near 
global minimization if the over-identification assumption is satisfied. 



7 Semi-parametric efficiency 

The results in Sections 5-6 demonstrate that the choice of the instrumental variable 
V(/3) only changes the asymptotic variance of the estimator, but does not affect the variable 
selection consistency or the rate of convergence. Therefore, the specific choice does not 
matter if our focus is just on these properties, but not on the semiparametric efficiency, that 
is, the minimum asymptotic variance of the estimator. 

On the other hand, one can always follow a two-step post-FGMM procedure if the semi- 
parametric efficiency is indeed one of the objectives. In linear regression, this has been 
achieved by Belloni and Chernozhukov (2011a). 

After achieving the oracle properties in Theorem 15. H we have exactly identified the 
important regressors with probability approaching one, that is, 

^ = {j:/3,^0}, ±s = {X^:jeS), P{S = S) ^ I. 

Then the problem of achieving semiparametric efficiency (in the sense of Newey (1990) and 
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Bickel, Klaassen, Ritov, and Wellner (1998)) in a low dimensional model: 

E[giY,Xlf3,s)\Xs]^0 

has been well studied in the literature (see, for example, Chamberlain (1987), Newey (1993)). 
In particular, Newey (1993) showed that the semiparametric efficient estimator of /Sg^ can 
be obtained using GMM with moment condition: 

E[g{Y, Xi/3os)a(X5)-^D(X5)] = (7.1) 

where 

For simplicity, we restrict s — 0{1), and only consider the nonlinear regression model: 

for some known differentiable function h{-). Suppose there exists a consistent estimator 
^(X^)^ of (t(X5)^, we then estimate /3o5 by solving 

Pn(^s) = - J^iYi - h{5tlf3s))h'{±lds)H^ir'%s = (7.2) 
1=1 

on a compact set © C in which /3q5 is an interior point, where h'{-) denotes the first 
derivative of h{-). 

Let X be the support of X5. 

Assumption 7.1. (i) There exists Ci > and C2 > so that 

Ci < inf C7(x)^ < sup cr(x)^ < C2. 
xex xex 

In addition, there exists a(x) such that 

sup |5'(x)^ — cr(x)^| = Op(l). 

(a) Parameter space: (3^^ lies in the interior of a compact set G M*. 
(Hi) £^(sup^^g@g h{X^(3sY) < 00, sup^ |^'(^)| < oo? <^™^ sup^ l^"(^)l < 
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aiXsf = E[g{Y,X.l(3,sf\^si and D(X5) = E 



dg{Y,X'sM 



The existence of a consistent estimator for o"(x)^ can be obtained in many interesting 
examples. 

Example 7.1 (Homoskedasticity). Suppose Y = h{X.^/3Qg) +e, where e and are inde- 
pendent. Then 

a(Xs? = E(e^\Xs) = a\ 



which does not depend on X5, and hence can be consistently estimated by = ^ '^^=i(Xi ~ 
h{X^gf3 g)Y . In this case, equations (17. ip and (17.21) do not depend on cx^ and (17.21) is simply 
the normal equations of the ordinary least-squares. 

Example 7.2 (Exponential family). Consider a generalized linear model where the condi- 
tional density of Y given X5 belongs to the exponential family 

f{Y- Xs, 6) = ciY) exp[FX^/3o5 - KX^/S^s)]. 

Then a{'Ks)'^ = 6"(X^/3q5.), and can be consistently estimated by ^"(X^/?^). 

Example 7.3 ( Nonpar ametric approach). One can also assume a semi-parametric structure 
on the functional form of (j(X5)^: 

where /(■; 6) is a nonparametric function parameterized by 6. We can then estimate a(Ks)'^ 
using a standard semi-parametric method. More generally, we can proceed by a pure non- 
parametric approach via regressing [Y — h{'X.gl3g)]'^ on X5 (see Fan and Yao, 1998). 

Condition (iii) in Assumption 17.11 is a technical assumption. We need the fourth moment 
of h{-) to be uniformly bounded to apply the uniform weak law of large number: 



n 

sup \-J2h(^s(3s)'-Ehinf3s)'\ = Op{l). 



(3s^e n .^^ 

For example in the linear regression, h(X.'^f3g) = X.'^fBg, then due to the compactness of 
e, E(sup^^ges ^(^s/^s)^) - ^^W^sV < oo- For other interesting models in GLM, this 
condition has been verified by Example 15.11 in Section [5l 

Theorem 7.1. Suppose s = 0(1), Assumption\7. 1\ and those of Theorem \5.1\ hold. Then 
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and [E{a(Xs)-^h'(X^(3os)'^XsX^)]-^ achieves the semi- parametric efficiency bound in 
Chamberlain (1987). 

8 Monte Carlo Experiments 
8.1 Design 1 

To test the performance of FGMM for variable selection, we simulate from a simple linear 
model: 

(/3oi, /3o2, /3o3, /3o4, /Sos) = (5, -4, 7, -1, 1.5); /3o, = 0, for 6 < j < p. 
The p-dimensional vector of regressors X is generated from the following process: 

Z = (Zi, Z, f ~ Np{0, S), (S),, = 0.5l^-^'l, 

(Xi, Xs) = (Zi, Z,), X, = {Z^ + 5){e + 1), for Q < j < p. 

where Z is independent of e. The unimportant regressors are correlated with both important 
regressors and the error term. 

The data contains n = 200 i.i.d. copies of (y, X). PLS and FGMM are carried out 
separately for comparison. In our simulation we use SCAD with pre-determined tuning 
parameters of A as the penalty function. 

We use the logistic cumulative distribution function with h = 0.1 for smoothing: 



1 + exp(t) \h J \ h 

There are 100 replications per experiment. Four performance measures are used to com- 
pare the methods. The first measure is the mean standard error (MSE^) of the impor- 
tant regressors, determined by the average of \\(3g — (3qs\\ '^^^^ replications, where 
S = {1,...,5}. The second measure is the average of the MSE of unimportant regressors, de- 
noted by MSEjv- The third measure is the number of correctly selected non-zero coefficients, 
that is, the true positive (TP), and finally, the fourth measure is the number of incorrectly 
selected coefficients, the false positive (FP). In addition, the standard error over the 100 repli- 
cations of each measure is also reported. In each simulation, we initiate = (0, ...,0)"^, 
and run a penalized least squares (SCAD(A)) for A = 0.01 to obtain the initial value for 
the FGMM procedure. The results of the simulation are summarized in Tables [2ll3l which 
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compare the performance measures of PLS and FGMM for three values of p. 

Table 2: Performance Measures of PLS and FGMM when p — lb 



PLS FGMM 





A = 0.05 


A = 0.1 


A = 0.5 


A = 1 


A = 0.05 


A = 0.1 


A = 0.2 


A = 0.4 


MSE5 


0.147 


0.138 


0.626 


1.452 


0.193 


0.177 


0.203 


0.953 




(0.055) 


(0.052) 


(0.306) 


(0.320) 


(0.066) 


(0.067) 


(0.061) 


(0.241) 


MSEjv 


0.076 


0.062 


0.084 


0.093 


0.010 


0.004 


0.003 


0.004 




(0.023) 


(0.014) 


(0.013) 


(0.017) 


(0.026) 


(0.014) 


(0.015) 


(0.017) 


TP-Mean 


5 


5 


4.85 


3.57 


5 


5 


5 


4.55 


Median 


5 


5 


5 


4 


5 


5 


5 


5 




(0) 


(0) 


(0.357) 


(0.497) 


(0) 


(0) 


(0) 


(0.5) 


FP-Mean 


9.356 


8.84 


2.7 


1.34 


0.099 


0.090 


0.02 


0.04 


Median 


10 


9 


3 


1 
















(0.769) 


(0.987) 


(1.127) 


(0.553) 


(0.412) 


(0.288) 


(0.218) 


(0.197) 



PLS has non-negligible false positives (FP). The average FP decreases as the magni- 
tude of the penalty parameter increases, however, with an increasing average MSE as well 
since larger penalties also incorrectly miss the important regressors. For A = 1, the median 
of true positives is only 4. In contrast, FGMM performs quite well in both selecting the 
important regressors, and correctly eliminating the unimportant regressors. The average 
MSE of FGMM is only slightly larger than that of PLS when A = 0.05 and 0.1. This is 
understandable since the FGMM as implemented does not intend to be efficient in estimat- 
ing parameters. When the correct regressors are selected by the FGMM, since the error 
distribution is normal, adding an extra term X| term in the square loss makes parameters 
inefficiently estimated. A solution to this efficient issue is the two-stage post-FGMM in 
which the ordinary least-squares are run again using the variables X5 (because the error is 
normal; sec Section 7). Note that A = 0.4 is a large tuning parameter that results in some 
incorrectly eliminated important regressors, and a larger MSE. 

8.2 Design 2 

Consider the same simple linear model with 

(^01, ^02, A3, A4, As) = (5, -4, 7, -1, 1.5); A,' = 0, for 6 < j < p. 
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Table 3: Performance Measures of PLS and FGMM when p = 50 



PLS FGMM 





A = 0.05 


A = 0.1 


A = 0.5 


A = 1 


A = 0.05 


A = 0.1 


A = 0.2 


A = 0.4 


MSE5 


0.145 


0.133 


0.629 


1.417 


0.261 


0.184 


0.194 


0.979 




(0.053) 


(0.043) 


(0.301) 


(0.329) 


(0.094) 


(0.069) 


(0.076) 


(0.245) 


MSE^v 


0.126 


0.068 


0.072 


0.095 


0.001 





0.001 


0.003 




(0.035) 


(0.016) 


(0.016) 


(0.019) 


(0.010) 


(0) 


(0.009) 


(0.014) 


TP-Mean 


5 


5 


4.82 


3.63 


5 


5 


5 


4.5 


Median 


5 


5 


5 


4 


5 


5 


5 


4.5 




(0) 


(0) 


(0.385) 


(0.504) 


(0) 


(0) 


(0) 


(0.503) 


FP-Mean 


37.68 


35.36 


8.84 


2.58 


0.08 





0.02 


0.14 


Median 


38 


35 


8 


2 
















(2.902) 


(3.045) 


(3.334) 


(1.557) 


(0.337) 


(0) 


(0.141) 


(0.569) 



Table 4: Performance Measures of PLS and FGMM when p = 300 



PLS FGMM 





A = 0.05 


A = 0.1 


A = 0.5 


A = 1 


A = 0.05 


A = 0.1 


A = 0.2 


A = 0.4 


MSE5 


0.186 


0.159 


0.650 


1.430 


0.274 


0.187 


0.193 


1.009 




(0.073) 


(0.054) 


(0.304) 


(0.310) 


(0.086) 

5 X 10-^ 


(0.102) 


(0.123) 
5 X 10"^ 


(0.276) 


MSEjv 


0.221 


0.107 


0.071 


0.086 





0.002 




(0.037) 


(0.019) 


(0.023) 


(0.027) 


(0.006) 


(0) 


(0.005) 


(0.010) 


TP-Mean 


5 


5 


4.82 


3.62 


5 


5 


4.99 


4.45 


Median 


5 


5 


5 


4 


5 


5 


5 


4 




(0) 


(0) 


(0.384) 


(0.487) 


(0) 


(0) 


(0.100) 


(0.557) 


FP-Mean 


227.96 


210.47 


42.78 


7.94 


0.11 





0.01 


0.05 


Median 


227 


211 


42 


7 
















(10.767) 


(11.38) 


(11.773) 


(5.635) 


(0.37) 


(0) 


(0.10) 


(0.330) 
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The p- dimensional vector of regressors X is generated from the following process: 

Z = [Z,, Zpf ~ N,{0, E), (E),, = 0.5l^-^'l, 

(Xi, Xioo) = {Zi, Zioo), X, = (Z, + 5)(£ + 1), for 101 < j < p. 

where Z is independent of e. Now the first 95 unimportant regressors are exogenous while 
the rest are endogenous. We run the same FGMM procedure for n = 200 and p = 300, 
with an additional post-GMM step to improve the mean squared error of the estimates. The 
results are reported in Table [5l We can see that the penalized FGMM still performs quite 
well when there are both exogenous and endogenous unimportant regressors. In addition, 
after running the additional post-FGMM step, one achieves a better accuracy of estimation. 



Table 5: Performance Measures of PLS, FGMM and post-FGMM when p = 300 





PLS 






FGMM 






A = 0.1 


A = 0.5 


A = 0.1 


post-FGMM 


A = 0.2 


post-FGMM 


MSE5 


0.278 


0.712 


0.215 


0.190 


0.241 


0.188 




(0.089) 


(0.342) 


(0.085) 


(0.068) 


(0.174) 


(0.069) 


MSE^ 


0.541 


0.118 


0.018 




0.006 






(0.083) 


(0.056) 


(0.042) 




(0.011) 




TP-Mean 


5 


4.733 


5 




4.97 




Median 


5 


5 


5 




5 






(0) 


(0.445) 


(0) 




(0.171) 




FP-Mean 


206.26 


31.14 


3.56 




3.58 




Median 


207 


31 


3 




3 






(13.658) 


(9.024) 


(2.231) 




(2.235) 





8.3 Design 3 

To study the sensitivity of our procedure to the minimal non-vanishing signals, we run 
another set of simulations with the same data generating process as in Design 1 but we change 
/34 = —0.5 and (3^ = 0.1, and keep all the remaining parameters the same as before. The 
minimal non-vanishing signal becomes I/Ssi = 0.1, and we run for p = 50,300 and n = 200. 
All the unimportant regressors are endogenous as in Design 1. Table [6] indicates that the 
minimal signal is so small that it is not as easily distinguishable from the zero coefficients 
as before. 
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Table 6: Performance Measures of FGMM when p = 50, = 



-0.5, (3, = 0.1 



A 


0.001 


0.005 


0.01 


0.05 


0.1 


MSEs 


0.160 


0.155 


0.150 


0.199 


0.277 




(0.050) 


(0.047) 


(0.055) 


(0.051) 


(0.163) 


MSEjv 


0.069 


0.074 


0.088 


0.002 


0.003 




(0.017) 


(0.016) 


(0.028) 


(0.011) 


(0.014) 


TP-Mean 


4.61 


4.49 


4.42 


4 


3.78 


Median 


5 


4 


4 


4 


4 




(0.492) 


(0.502) 


(0.496) 


(0) 


(0.416) 


FP-Mean 


15.94 


3.96 


1.48 


0.07 


0.07 


Median 


16 


3 


1 










(3.405) 


(1.959) 


(0.959) 


(0.383) 


(0.356) 



Table 7: Performance Measures of FGMM when p = 300, (S^^ = —0.5, (3^ = 0.1 



A 


0.001 


0.005 


0.01 


0.05 


0.1 


MSEs 


0.174 


0.164 


0.168 


0.211 


0.247 




(0.055) 


(0.054) 


(0.056) 


(0.061) 


(0.156) 


MSEjv 


0.107 


0.097 


0.083 


5x10-^^ 


0.002 




(0.018) 


(0.023) 


(0.036) 


(0.005) 


(0.012) 


TP-Mean 


4.59 


4.52 


4.28 


4.02 


3.83 


Median 


5 


5 


4 


4 


4 




(0.494) 


(0.502) 


(0.451) 


(0.141) 


(0.378) 


FP-Mean 


76.43 


7.83 


1.4 


0.01 


0.06 


Median 


77 


7 


1 










(11.19) 


(3.613) 


(0.985) 


(0.1) 


(0.371) 
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9 Conclusion 



Endogeneity arises easily in high-dimensional regression due to a large pool of regres- 
sors. This causes the inconsistency of the penalized least-squares methods and possible false 
scientific discoveries. When there exists an endogenous variable whose true regression coef- 
ficient is zero, the penalized LS does not satisfy the necessary condition of variable selection 
consistency regardless of the penalty function. 

We propose to penalize an FGMM loss function. It is shown that FGMM possesses the 
oracle property. By the assumption of over-identification, one can also achieve the oracle 
property with near global minimization. 

We give sufficient and necessary conditions for a general penalized optimization to achieve 
the consistency for both variable selection and estimation, and apply these results to the 
sparse conditional moment restricted model, which covers a broad range of applications. 

In addition to FGMM, it is also possible to achieve the oracle property using the penalized 
empirical likelihood (PEL) . The empirical likelihood was first proposed by Owen (1988). Since 
it is defined based on estimating equations and moment conditions, it has been an appealing 
alternative to GMM. The PEL criterion function can be constructed in a similar way, whose 
oracle properties can also be achieved. We will leave this for future research. 

The current paper has assumed that the important regressors be exogenous. In some 
applications in social sciences, however, they are possibly endogenous as well. In this case, 
the oracle property should also be achieved with the help of instrumental variables. Recently 
Gautier and Tsybakov (2011) considered a high dimensional instrumental variable approach. 
We will explore this direction in depth in the future. 

A Proofs for Section 2 

Throughout the Appendix, C will denote a generic positive constant that may be different 
in different uses. 

A.l Proof of Theorem 12.11 

Proof. When /3 is a local minimizer of Qn{f3), by the Karush-Kuhn- Tucker (KKT) condition, 

V/ ^ S, 

— + = 0, 
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where vi = P^(| AI)sgn(A) if A = 0; t;^ G [-F^(0+), P^(0+)] if A 
P^(0"'") = limj^o+ Pnit)- By monotonicity of Pn{t), we have 



0, and we denote 



(A.l) 



By Taylor expansion and the Cauchy-Schwarz inequahty, there is (3 on the segment joining 
f3 and /3q so that 



max 

Z^5 



d(3i 



< max 



OS I 



Since ||/3g — /3osll = Op{l), and due to the condition of the theorem, we have 



max 



dLn{l3) dLni(3, 



df3i 



0. 



Combining the last two labeled results, we conclude that 



(A.2) 



Q.E.D. 



A.2 Proof of Theorem [2721 

Proof. Let {Xj^}"^^ be the i.i.d. data of Xi where Xi is an endogenous regressor. Note 
that in penalized LS, Ln{(3) = ^ Yl^=i0^i ~ '^If^Y- Under the theorem assumptions, by 
the strong law of large number ^^^^(/Sq) = — ^ ^"=i-^i/(^ ~ '^JPo) ~^ — 2-E(X/£:) almost 
surely, which does not satisfy the necessary condition of Theorem 12.11 Q.E.D. 
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B Proofs for Section 4 



B.l Proof of Theorem 14.11 



Lemma B.l. Under Assumptions \4.1\ and s/y/n = o((i„), if (3 = {(3i, /3s)'^ is such that 
maxi<s \f3j - /3osj| < dn, then 

s 

\Y,Pnm\)-PnmsJ)\<\\f3-(3os\\V~sPnidn). 

Proof. By Taylor's expansion, there exists /3* lying on the line segment joining f3 and (3qs, 

s 

= (p^(i/3*i)sgn(/3*), p:,mH^mfif3 - m 

< ||/3-/3o5llv^maxP^(|/3;|). 
Then min{|/3*| : j < s} 

> mm{\(3os,j\ ■ J < s} - max \(3* - (3os,j\ > 2c?„ - rf„ = d^. 

Since is non-increasing (as Pn is concave), P„(|/3*|) < Pnidn) for all j < s. Therefore 

EU^P^m) - Pn{\^os,\) < \\(3-f3,s\\V~sPLidn)- Q.E.D. 
Proof of Theorem 14.11 

The proof is a generalization of the proof of Theorem 3 in Fan and Lv (2011). Let 
kn = an + y/sPn{dn). It is our assumption that kn = o(l). Write Qi{(3g) = Qn{f3s, 0), and 
Li{f3g) = Ln{l3g,Q). In addition, write 



Define Afr = {f3 E W : \\f3 — (Bq^W < knr} for some r > 0. Let dAfr denote the boundary 
of Afr- Now define an event 

Hn{r) = {QMs) < min Q^i(3s)}. 

On the event iJ„(r), by the continuity of Qi, there exists a local minimizer of Qi inside 
Afr- Equivalently, there exists a local minimizer {(3^,0)^ of Qn restricted on B inside {/3 = 
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(/3^, 0)"^ : e K}- Therefore, 

P{0s-/3os\\<knr)>PiH^{T)). 

Hence it suffices to show that Ve > 0, there exists r > so that P{Hn{T)) > 1 ~ e, and that 
the local minimizer is strict. 

For any f3g G dAfr, which is \\(3g — /Josll = ^n^, there exists (3* lying on the segment 
joining fBg and /3o5 such that by the Taylor's expansion on Li{/3g): 



Q,{/3s)-Qi{(3,s) 

1 
2 

s 



By Condition (i) that || VLi(/3q5) || = Op{a„), for any e > 0, there exists Ci > 0, so that 
Pi{(3s - (3osfVU{(3,s) > -Ci\\(3s - Man) >l-e. (B.l) 
In addition, Condition (ii) yields that there exists C > such that w.p.a.l, 

{f3s - /3,sfV'L,{f3,s){(3s ' M > C\\f3s - M'- 
Hence by the continuity of V^I/i(-), and that \\f3g — (^q^W — t- 0, 

(^5-/3o5fV%(/3*)(^5-/3o5) > ^Wf^s-M'- 

By Lemma Ell Ei=i[^n(l/55il) -^n(|/3o5,il)] > -\^Pnidn)\\(3s- f^osW- Hence we can choose 
r > large enough (for example, rC/4 > max{l, Ci}) so that, on the event 

we have: 

k tC 

min Qi(/3) - Q,if3,s) > 11/3^ - - Cia^ - v^P^K)) > 0. 

BydHD, P(i7„(r))>l-£. 

It remains to show that the local minimizer in jVr (denoted by /Bg) is strict. For each 
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h G M/{0}, define 



^/^(/i) = limsup sup . 

£-s>0+ H<t2 h — ti 

iti,t2)ei\h\-e,\h\+e) 

By the concavity of Pn{-), '?/'(■) > 0. We know tliat Li is twice differentiable on W. For 
l3s e Mr Let 

Aif3s) = V'L^if3s) -dmg{i;{f3s^),...,mss)}- 

Since H/^^ — /Sosll = Op{l), by Condition (ii), tliere exists C > sucli tliat for any non- 
vanisliing o; G M*, with probability approaching one, 

By assumption fc„ = o{dn), hence H/Q^ — /Sq^H < dn w.p.a.l. By the definition of ?7(-), w.p.a.l, 

maxipiPsj) < r]0s) < sup r]{f3). 



Therefore, 



P{a^A{f3s)a > \\a\\{C - sup r]{i3))) ^ 1, 



which imphes 0!.'^A(Pg)ct > C/2 w.p.a.l by Assumption 14.11 Therefore A(f3g) is positive 
definite w.p.a.l. Q.E.D. 



B . 2 Proof of Theorem 11:21 

Proof. Let f3 = {f3s,0)^, with (3g G Mr being a strict local minimizer of Li{f3g), as in the 
proof of Theorem 14.11 It remains to prove that /3 is indeed a strict local minimizer of Qn{f3) 
on the space W. To show this, take a sufficiently small ball A/i in W centered at /3 such that 

MinBciifSlof -./SseAfr}. (B.2) 

We recall the definition 

S = {/3 G : /3j = if (3oj = 0}, 

which is {/3 = T/3}. We then need to show that V7 G Afi\0}, Q„0) < Qn{l) w.p.a.l. 
Note that if 7 = (7I:, 7^)^ with 7Ar = 0, then 7 G i3 and by Theorem |4J4 Qnif^) < Qnil)- 
Therefore we consider the case when 77V 7^ 0. In addition, note that Qn{P) < QnO^l), where 
^(7) = (75, 0), the projection of 7 onto B. Thus, it suffices to show: 
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Claim: There exists a sufficiently small Mi satisfying (lB.2p such that V7 G A/i, with 
In 7^ 0, Qni^l) < Qn{l) w.p.a.l. 

In fact, this is implied by Condition f l4.2p : 



Qn(T7) - Qn(7) = ^n(T7) - L^{l) - (5^i^.(7.) - 5^Pn(|(T7),|)) < 0. 

If L„ is continuously differentiable in a neighborhood of /3q, by the mean value theorem, 
there exists A G (0, 1) such that for /i = A7 + (1 — A)T7, 

Qn(T7)-Q(7) = Y.^-^{-ii)-Y.Pnm)\ii\ 



lis 
lis 



d/3i 
dLn{h) 



lis 

'Pni\hl\)]bl\ 



where we used dPn{\t\)/dt = P^(|t|)sgn(t), and the fact that sgn(/i;) = sgn(7;) for I ^ S. It 
thus suffices to show, the following holds w.p.a.l: 



Suppose we have 



max 

lis 



max 

lis 



dK{h) 



dL„,0) 



df3i 



~PLi\hi\)<Q. 



(B.3) 



then by continuity, there is 5 > 0, for any /3 in a ball in centered at (3 with radius 5, 



max 

lis 



- PL{S) < 0. 



We further shrink the radius of the ball A/i to less than 5 so that < 5 for any j ^ 5*. 
Hence 



max 

lis 



dLn{h) 



-P'Mh, 



= max 

lis 

< max 

lis 



dLn{h) 



df3i 



P'n{Mll\) 
Puis) < 0, 



where we used the monotonicity of P^(-). Hence it remains to prove flB.31) . By the triangular 
inequality. 



max 

lis 



dPi 



< max 

lis 



dPi dPi 



+ max 

lis 
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By assumption, max;^5 = Op{P^{0^)). For the first term on the right hand side, 

apply the mean value theorem (note that (3 and /3q only differ at the coordinates in S), 



max 

lis 



df3i 



< max 

lis 



< max 



j^s ^^^^^^ 



0j - /3o,) 



v^ll35-/3 



05 1 



op{PLm). 



where (3 lies on the line segment joining (3 and /3o, and we used the Cauchy-Schwarz inequal- 
ity. 

Q.E.D. 



B.3 Proof of Theorem grsl 

Proof. The KKT condition of (3^ gives 

where o denotes the Hadamard product of two vectors. By the mean value theorem, there 
exists (3* lying on the segment joining (3^^ and (3^ such that 



VsLn{f3s, 0) = VsLn{(3,s, 0) + V|L„(/3*, Q){(3s - (3 



OS) 



Since ||3s - (^osW = Op{l), we have VlLn{f3* , 0) = V|L„((/3os, 0) + Op(l), where Op(l) is in 
terms of the Frobenius norm. Therefore, 

(V|L„((/3o5, 0) + Op{l))0s - M = -Kildsl) ° sgn(35) - '^sLnif3,s, 0). (B.4) 



For any unit vector o; G M'^, by Condition (ii), ||Q;^nn[P^(|/35|) o sgn{(3s) 
Hence the result follows immediately from (IB.4p and Condition (i). Q.E.D. 



C Proofs for Section 5 

According to Theorems 14.11 and 14.21 minimization of Qfgmm can be first constrained on 
B = {f3 E MP : f3j = if j ^ S}, and consider I/gmmI/^s) = -Z^fgmm(/35) 0) instead, which is 
assumed to be twice differentiable. We then proceed to show by using Theorem 14. II that (3^ 
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is a local solution to 

s 

mill Lgmm 

and that \\f3g — fB^gW = Op{l). After that, we use Theorem 14.21 to conclude that {f3g,0)'^ is 
also a local solution to min^gRp Qfgmm(/3). 

Throughout the proof, we write X^^ = X.^{f3Qg) and \is = (X^,X^J)-^. 

C.l Lemmas 

Lemma C.l. (i) max;<p |i ^"^^(Xj^ - Xj)'^ - var(Xj)| = Op(l). 
(u) maxK, |i Er=i(A.^- - W - = 

(in) sup^gjgp Amax(W(/3)) = Op(l), and Xmmiy^if^o)) is bounded away from zero w.p.a.l. 

Proof. Parts (i)(ii) follow from an application of the standard large deviation theory by 
using Bernstein inequality and Bonferroni's method. Part (iii) follows by the assumption 
that var(Xj) and var(X|) are bounded uniformly in j < p. 

Lemma C.2. If A, B and A — B are all semi-positive definite, then Aniax(A) > Amax(B). 
Proof. Let ex. be the eigenvector of B corresponding to the largest eigenvalue, ||q;|| = 1. Then 

Amax(A) - Ainax(B) = Amax(A) - (X^Bct 

= Amax(A) + a^(A - B)a - ct^Acx 
> A^ax(A) - a^Aa > 0. 

Lemma C.3. max,,c. \\^Z^^^m{Y,,Xj f3o)X,,V,s\\l = 0,{vl, + ^). 

Proof. Note that the Bernstein inequality plus Bonferroni's method imply that 

1 " 

max \\-^m{Yi, Xf /3o)Xij Vi^Hs 



< max ||Em(r„ Xj f3,)X,Ys\\2 + Op{ 

Since Em{Yi,X.J Pf^fXfVsV^ - Em{Yi,X.J f3Q)XjVsEm{Yi,Xjl3f^)XjY'^ is semi-positive 
definite, by Lemma IC.2I and Assumption 15. 5^ 

\\Em{Y,X^f3o)X,Ys\\l < \r.UEm{Y,X' f3,fX^Y s^l) = 0{rii). 
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C.2 Proof of Theorem [531 
C.2.1 Consistency 

For any f3 G W, we can write T/3 = (^5, 0)^. Define 



1 



W(/3o) 



■t=l 



Tlien LgmmI/^s) = Lfgmm{(3 g , 0) . We proceed by verifying tlie conditions in Theorem 14.11 
Condition (i): 

VLgmm(/3o5) = 2A„(/3o5)W(/3o) [^J::=l9iy^,^sPos)'^^s], wfiere 



1 " 



(C.l) 



By Assumption 15.41 ||A„(/3q)||2 = Op{l). In addition, tlie elements in W(/3q) are uniformly 
bounded in probablity due to Lemma IC.ll Hence 



1 " 

||VLcmm(/3o5)II < 0,(l)||-5^(?(F„Xfs/3o5)V 



\s\ 



Due to X^/3o5)X5 = £"(7(1^, X5/3o5)X| = 0, using the exponential-tail Bernstein 

inequality with Assumption 15.21 plus Bonferroni's method, it can be shown that for any 
t > 0, 



^ n 1 " 

P{Bmx\-J29{Y.,XJs(3,s)Xu\ > t) < sma^P{\-J29{y^,^Is(3os)XH\ > t) 

i=l i=l 



< exp I logs 



n 



which implies that 



Similarly, 



1 



i=l 



n 



1 " 



i=l 



(C.2) 



(C.3) 



Hence ||VLgmm(/3os)II = Op{^/{s^ogs)/n) . 
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Condition (ii) Straightforward but tedious calculation yields V^I/gmm(/3os) — ^(/3o5) + 
M(/3o5), where 

S(/3o5) = 2A„(/3o5)W(/3o)A„(/3o5)^, 



and 



M{(3,s) = 2H(/3o5)B(/3, 



OS) 



with (suppose Xis = [Xu^, 



1 " 
1=1 



T 



B(/3os) = W(/3o 



1 " 



Op(s), and hence 



It is not hard to obtain ||B(/3os)|| = Op(^s logs/n), and ||H(/3o5)|| 
l|M(/3o5)ll = C'p(s^s log s/n) = Op(l). Therefore, the eigenvalues of V^Lgmm(/3os') 
bounded away from zero w.p.a.l. 

C.2.2 Sparsity 

To show the sparsity, we check (14. 2p in Theorem 14.21 
For some neighborhood JV of {f3g, 0)^, and V7 G JV, write 



7 = (75,7^)', andT7 = (7j,0f 

In addition, we write Vj(75') = Vj(T7), Vj(7Ar) = Vj(7 — T7), and W(75) = W(T7) for 
notational simplicity. 
For all e eW, define 



Fi9) 



W(7c 



n 

1=1 



iX^<7(F„Xf^)V,(75 

i=l 

Hence Lfgmm(T7) = F{T'j), and I/fgmm(7) = ^(7) + 6(7), where 

^ n 1 " 

6(7) = (-$^^7(y.,Xf7)V,(7^))^W(7^)(-5^(7(r„Xf7)V,(7;v)) > 0. 



i=l 



i=l 



Hence 



^fgmm(T7) - Lfgmm(7) < ^(^7) - ^(7)- 
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Note that T7 — 7 = (0, —7^)"^- By the mean value theorem, there exists A G (0, 1), for 
h = ill -XlW, 



F(T7)-F(7)-E(P„(|7,|)-Pn(|(T7) 



i=i 



- 5^ 7i 



1 " r) 



W(7. 



n ^ 
- E l7d^:(A|7z|) 

J2 iMh)-\^,\p:,{xh\). 



Hence it suffices to show that there exists so that for any 7 G M, 



1 " 

-V^7(F„Xf/i)V,(7c 
n ^-^ 



max |7,aK/i)| - \li\P'n{Ali\) < 0- 



(C.4) 



Suppose we have, for f3 = {f3g, 0)"^, 



max \aii/3)\ = o,{KiO^)), 



(C.5) 



by continuity, there is 5 > 0, for any /3 in a ball in MP centered at /3 with radius 6, 

max |aK/3)|-P:(5)<0. 

We further shrink the radius of A/" to less than S so that |7;| < S for any Z ^ S". By the 
monotonicity of Pn{-), 

max |a,(/^)|-P:(Ahl)< max |a,(/^)| - P;(5) < 0. 

Hence it remains to prove (1C.5I) . By the triangular inequality, 

max \ai{(3)\ < max |ai(/3) — a;(/3Q)| + max |a;(/3Q)|. 

i^5,7i5^0 lifS 1<^S 



Since X^/3o)|X5) = 0, by Assumption [ESI and (10121) ( fCll) 



inax|ai(/3o)| < 



^ n 1 " 

- ^(^- Xf /3o)X,,V,(75) W(75)- 5^ Xf ;i)V,(75) 

i=l 1=1 
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.(('^n + ^1^^) Vis log s)/n) = o,(i^:(0+)), 



where we used the triangular and Bernstein inequahties to obtain 



max 

Its 



1 " 

- Vm(y„Xf/3o)X,;V,(7c 
n ^-^ 



< max ||Em(y, ^ ^^)XiS s\ 

l^S 



+ max 

lis 



1 " 



0M + 0,(^/^). 



On the other hand, applying the mean value theorem and Cauchy-Schwarz inequality 
^ives (note that (3 and (3q only differ at the coordinates in S)^ 



max |a/(/3) — a/(/3n)| < max 



9/3, 



v/i||35-/305ll=Op(^n(0+)). 



where (3 lies on the line segment joining f3 and /3o- Note that 



max 

i^sjes 



dai{f3o) 



dp, 



^ n 1 " 

< ||-5^g(F„Xf/3o)X,,X,,VfW(75)-$^^7(F„Xf/3o)V,|| 

1=1 i=l 
n 1 " 

+ 11- Vm(y„ Xf/3o)X,,VfW(75)- Vm(y„ Xf/3o)X,,V, 



i=l 



i=l 



= Op(A/slogs/n + {^Js\ogp/n + k„)(a/s logs/n + ?7„)), 
where in the last equality, we used Lemma IC.3I to bound the second term on the right. 



Therefore, flU3|) holds as long as fi;„r/„s(P^(c/„) + ^logs/n) = o(P;;(0+)). Q.E.D. 



C.3 Proof of Theorem [5721 

Let P^d/?^!) = {P'^{\j3si\) P'n{\Pss\))'^ ■ The asymptotic normality builds on the fol- 
lowing lemmas. 

Lemma C.4. Under Assumption \4-l\ and sj = o{dn), for an, (3^ defined in Theorem \4-l 



iP^d^sl) ° ^Mds)\\ = Op(maxr/(/3)a„ + v^P^K)), 



where Afi = {(3 e W : ||/3 - f3os\\ < C ^ (slogs) /n}, for some C > 0, and o denotes the 
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element-wise product. 
Proof. Write 

P;,{\Ps\)osg^0s) = ivi,...,vsf, where Vi = P;,{\Ps^\Hn0s^) 
By the triangular inequality and Taylor expansion, 



\vi\ < \P:,{\Psi\) - PMs,i\)\ + PMs,i\) < maxr^(/3)|^5i - Pos,i\ + P'n{dn). 

peA/i 



Therefore, 



\P'A\h\)osgn0s)r = X;^|<2j;max,7(/3f|^5i-/35.r + 2<K)^ 



Ml 

=1 i=l 



< 2maxr^(/3f||/35-/3o5ir + 2<Kf, 
/3eA/i 

which implies the result since ^(ig — /Sg^H = Op{an + \/sPn{dn)). Q.E.D. 
Lemma C.5. Let f2„ = ^/nT~^^'^ . Then for any unit vector a e W, 

Proof VLgmm{M = 2A„(/3o5)W(/3o)B„, where 

1 " 



n . , 
1=1 



We write 

r„ = 4HW(/3o)VoW(/3o)^H^, sxs 

Vo = var(V^B„) = vaT{g{Y,X.lf3^,g)Vs): 2s x 2s 

H = Em{Y,Xl^os)^syl sx2s. 

By the weak law of large number and central limit theorem for iid data, 

W-^niPos) - H|| = Op(l), and 
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for any unit vector cx G . Hence by the Slutsky's theorem, 



v^a^r-i/2vLGMM(/3o5) Ar(0, 1). 

Q.E.D. 

Note that in the proof of Theorem 15.11 condition (ii), we showed that 

where Op(l) is in terms of the Frobenius norm. By Theorem 14.31 it remains to check that 
for Vtn = y/nT^^^'^, Condition (ii) in Theorem 14.31 holds. By Assumptions 15.41 and I5.6( i). 
Amm(r„)^^/^ = Op{l). Lemma IC. 41 then implies 

v^A^in(r„)-^/^|| p:(|35l) ° ^MdsW 

< C^/n{ma.xr]{^)^/slogs/n+ ^/sP'^{dn)) 

= Op{\/s log smaxr]{(3) + ^/nsP!^{dn)) = Op(l). 

Q.E.D. 



D Proofs for Sections 6 and 7 

The local minimizer in Theorem 15. II is denoted by /3 = {f3g, f3j^)'^, and P{f3j^ = 0) — 1. 
Let 3c = (3^, Of. 

D.l Proof of Theorem 16.11 
Lemma D.l. 

Lfgmm0g) = Op (^^^ + sP^(d„)2^ . 

Proof. We have, Lfgmm(3g) < II ^ Zl"=i fl'(^i) X^3s)^i5pOp(l). By Taylor expansion, with 
some f3 in the segment joining fB^g and f3g, 

^ n 1 " 

\\-Y,9iY^,^S^s)'^^s\\ < \\-J29iy^,^sM'^^s\\ 

i=l i=l 
1 " 

+ \\-Y^miY„X^sPs)^^syIsh\\^3s-M\ 
n ^ — ^ 

i=l 

1 

< Opi^slogs/n) + \\-J2m{Yi,X:[sM^^syJsh\\f3s-f3os\\ 

i=l 
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1 " . 



n 

i=l 



Note that \\Em{Y,X.^ f3Qg)X.sy s\\2 is bounded due to Assumption 15.41 Apply Taylor expan- 
sion again, with some /3 , the above term is bounded by 



0,i^s\ogs/n) + 0,il)\\^s-M 
1 

+-J]|g(F„xr^^;)|||X,5||p5-/3o5lll|X.5Vr5||||/3^-/3o5l 



n 

i=l 



Note that sup^^ Iq'(^1!^2)| < oo by Assumption I5.3[ We thus have, 
1 " . 

- J2 ^sK) I l|X.5|| 11^5 - M llX^sYf^ll 11/35 - f3os\\ 
1 " 

< c-J2\\^^s\msyJsms-(3osr 

< CE\\Xs\\\\XsVU{l + o,{l))0s-M'- 
Combining these terms, we obtain 

1 " ^ 

\\-Y,9{y^,^Js^3s)'^^s\\ = 0,i^s\ogs/n+y^P:,{d^)) + 0,{s^s)\\/3s-(3 

= Op{^slogs/n+ ^/^P;^{dn)). 

Lemma D.2. 



|2 
05ll 



QpcMMif^c) = Op h sP'^{dnf + smaxP„(|/3oj|) + Pl,{dn)s\ 

\ TL j&s V n 

Proof. By the foregoing lemma, we have 

Qfgmm(3g) = Op f ^ + + X^P„(|/35,|). 

Now, for some (3s j in the segment joining i3sj and /^oj, 

s s s 

j=l j=l j=l 
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< smaxP„(|/3o,|) + Vp^K)!/?^, - /3o5,,| 

< 5m^DcP„(|/3o,|) + P^K)||35-/3o5llv^- 



The result then follows. Q.E.D. 

Note that V5 > 0, 



inf Qfgmm(/3) > inf Lfgmm(/3) 
,30e«u{o} /3^e4U{o} 



> inf 

/3^e4U{o} 



-5^^(y„Xf/3)V,(/3) 



n 

i=l 



niin{w(Xj), vix(X. )}. 



Hence by Assumption 16.11 there exists e > 0, 



P( inf Qfgmm(/3) > 25) ^ 1. 

On the other hand, by Lemma [D. 21 Qfgmm(/3g) — Op(l)- Therefore, 

-P(Qfgmm(3) + £ > inf (5fgmm(/3)) 
/S^e^uio} 

= i^(QFGMM(3G) +^ > inf Qfgmm(/3)) + o(l) 

/3^0iU{O} 

< P(gFGMM(3G)+^>2£) + ^( inf Qfgmm(/3) < 2£) + o(l) 

< P(Qfgmm(3g) > ^) + o{l) = o(l). 

Q.E.D. 

D.2 Proof of Theorem PTTI 



eorem 



Lemma D.3. Define p{(3s) = E(Y - h(Xhl3s))h'(Xhf3Qs)^s(^{'^sy'^ ■ Under the th 
assumptions, 

sup Wpif^s) - Pni^s)\\ = Op(l). 

Proof. Given i?(sup^g@ /i(X^/3)^) < oo and sup^ \h"{t)\ < oo, we have the uniform law of 
large number (Newey and McFadden 1994, Lemma 2.4) 



1 

sup - J2 h"{^ls(^f - Eh"{Xlf3r = 0,(1), 



/3ge n 
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1 " 

sup - J2 K^lf^Y - EhCKlaY = 0,(1). 

Using these, we show three convergence results: 
1 " 

- J2 \\YP^^s{h'{X:[sf3s) - h\X.Js(3,smx 



iS) 



(D.l) 



i=l 



1 " 

sup - J2 \\h{XJs(3s)X,s{h'{XJs/3s) - h'{xJs/3os)M^^sr'\\ = o,{l), (D.2) 
1 " 

sup - \m - h0^ll3s))h'0^ll3,s)^,s{d{^.s)-' - <^rs)-')\\ = Op(l). (D.3) 



i=l 



For (1D.1I) . the left hand side is upper bounded by (for some (3 in the segment joining /3os 
and (3^-, and apply Cauchy-Schwarz inequality) 



n ^-^ 

i=l 



< 0,(1) 




\ ^=l 



< 0,(1) 0,(1) + snpEh"{^'smPs-M=Op{l), 
V -see 

where in the second inequality, we used the uniform weak law of large number. Similarly, 
the left hand side of (ID. 21) is upper bounded by 



1 - ^ 

sup - J2 ||MXf5/3s)X.5Xf5/i"(Xp)l| \\f3s - Ma{^^s)-' 

< 0,(1) ( snp -j2\\h{^sPs)^^s^s\n (-j2h"i^s~^r] 

(n n \ 

- Y W^^s^lt sup - Y hO^^sf^sf \\^s- M\ 
^ t=l ^s^® ^ i=l J 

< 0,(1) - J2 W^^s^JsW'Ml) + sup Eh{X^sf3sY) 11/35 - M 

= 0,(1), 

where both the first and second inequalities follow from the Cauchy-Schwarz inequality, and 
the third inequality follows from the uniform law of large number. fID.SP can be established 
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in a similar way since ^(Xg)^ uniformly converges to a^Ks)"^- 

Due to the previous convergences and that the event X5 = X5 occurs with probability 
approachong one, it remains to show that sup^^g© ||p(/35)|| < 00 and 

1 " 

sup \\-J2^.sh'{XJsf3,s){Y, - hiXJs(3s)MX,s)-' 

-E^sh'i^lfB.sW - hiX.l(3s)MXs)-'\\ = Op(l). 

The above result follows from the uniform law of large number to ^ ^"=1 h{^gf3gY ~ 
Eh{X.^f3g)'^, given that ii^sup^^g@ /^(X^/?^)^ < 00. The fact that sup^^g@ ||p(/35) || < 00 
follows from repeatedly using Cauchy-Schwarz inequality. 
Q.E.D. 

Given the foregoing Lemma ID.3[ Theorem 17.11 follows from a standard argument for 
the asymptotic normality of GMM estimators as in Hansen (1982) and Newey and McFad- 
den (1994, Theorem 3.4). The asysmptotic variance achieves the semiparametric efficiency 
bound derived by Chamberlain (1987) and Severini and Tripathi (2001). Therefore, f3 is 
semiparametric efficient. 

Q.E.D. 
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